Predicting Student Performance in Mathematics

A Data-Driven Approach to Academic Success

Group L21G05 | Lakshya Sakhuja (540831213), Mingyu Wang (540764541), Gary Zhang (520037603), Yujin Song (530538956)

2025-10-26

Slide 1: Dataset Overview

Dataset Overview

The Student Math Performance Dataset

  • Source: 395 secondary-school students from Portugal
  • Variables: 33 features (demographics, academic performance, lifestyle)
  • Target: Final math grade (G3) on a 0-20 scale

Goal

  • Predict final grades (G3) using early indicators
  • Identify key factors for academic success
  • Enable early intervention for at-risk students

Motivation

  • Math success shapes future learning, yet many students fall behind due to study habits, family background, and lifestyle factors. Early identification helps educators allocate resources effectively.

Research Question

  • “Can we predict final math performance from early grades, study habits, and lifestyle factors?”

Slide 2: Variables Overview

Variables Overview - Painting a picture of what we are working with

33 variables covering learning, lifestyle, and background

  • Demographics: gender, age, parental education, family support
  • Academic: grades from first (G1) and second (G2) periods leading to final (G3)
  • Study habits: weekly study time, past failures, absences
  • Lifestyle: weekday & weekend alcohol use, free time, social outings
  • “Together, these factors paint a 360° picture of each student’s life — both inside and outside the classroom.”

All Variables:

Category Variables
School school
Student sex, age
Family address, famsize, Pstatus, Medu, Fedu, Mjob, Fjob, reason, guardian, traveltime
Academic studytime, failures, schoolsup, famsup, paid, activities, nursery, higher, internet
Lifestyle romantic, famrel, freetime, goout, Dalc, Walc, health, absences
Grades G1, G2, G3 (Target)

Slide 3: Variables of Interest

Variables of Interest - Descriptive Statistics

Variables of Interest: Descriptive Statistics
Variable Description Min Max Median Mean SD
Age (years) Student age 15 22 17 16.7 1.3
G1 (First Period Grade) 1st period grade 3 19 11 10.9 3.3
G2 (Second Period Grade) 2nd period grade 0 19 11 10.7 3.8
G3 (Final Grade) - TARGET Final grade (target) 0 20 11 10.4 4.6
Absences (count) Absences 0 75 4 5.7 8.0
Going Out (1-5 scale) Going out 1 5 3 3.1 1.1
Mother Education (0-4 scale) Mother education 0 4 3 2.8 1.1
Father Education (0-4 scale) Father education 0 4 2 2.5 1.1
Weekend Alcohol (1-5 scale) Weekend alcohol 1 5 2 2.3 1.3
Study Time (1-4 scale) Study time 1 4 2 2.0 0.8
Weekday Alcohol (1-5 scale) Weekday alcohol 1 5 1 1.5 0.9
Past Failures (count) Past failures 0 3 0 0.3 0.7

Result: No missing data - clean dataset ready for analysis!

Slide 4: Demographics Snapshot

Demographics Snapshot - Student Population Overview

Slide 5: Key Relationships

Key Relationships - Academic Behavior, Lifestyle Factors & Grade Progression

Key Correlations: 1. Study time: 0.098 2. Weekend alcohol: -0.052 3. G1 vs G3: 0.801 4. G2 vs G3: 0.905

Key insight: G2 shows strongest correlation (0.905) - early prediction potential!

Slide 6: Correlation Heatmap

Correlation Heatmap - G2 Dominates Correlation Structure

Top G3 correlations:
          g2           g1        ln_g1        ln_g2 g2_per_study 
       0.905        0.801        0.794        0.784        0.528 

Slide 7: Interactive EDA Dashboard

Explore Predictors Dynamically

Features: Hover, zoom, pan, rotate

Insights: G1-G3-study time relationship | Gender differences in performance

Slide 8: Model Selection Approach

Multiple Linear Regression (MLR)

Model Framework:

G3 = β₀ + β₁(ln_absences) + β₂(G2) + β₃(fail_abs_ratio) + β₄(absences_per_study) + ε

Selection Process: 1. Stepwise selection (backward/forward) for initial screening 2. Exhaustive search (leaps/ImSubsets) for optimal subset 3. Out-of-sample validation using cross-validation (CV RMSE, AIC/BIC) 4. Final selection based on best balance of fit and simplicity

Hypothesis:

Academic performance (G2), absence patterns, and failure-absence relationships predict final grades with statistical significance.

Why MLR?

  • Interpretability: Coefficients directly show predictor impact on grades
  • Hypothesis testing: p-values validate statistical significance of predictors
  • Linearity assumption: G3 shows approximately linear relationships with predictors
  • Performance: Achieved best balance of R², RMSE, and model simplicity compared to alternatives
  • Generalizability: Feature engineering (ratios, log transforms) captures non-linear patterns within linear framework

Slide 9: Model Assumptions & Diagnostics

Checking MLR Assumptions

Validation: Linearity: no pattern | Homoscedasticity: constant variance | Normality: normal distribution | Independence: no outliers | No multicollinearity: VIF < 5

Slide 10: Model Assumptions & Diagnostics (Continued)

Assumptions and Limitations


Assumption Checks
1. Linearilty: Residual vs Fitted plot shows no clear pattern
2. Homoscedasticity: Scale-Location plot shows constant variance
3. Normality: Q-Q plot shows residuals follow normal distribution
4. Independence: Residuals vs Leverage shows no influential outliers

=== Formal Statistical Tests ===
1. Shapiro-Wilk: W = 0.802 , p = 1.70e-19 
2. Breusch-Pagan: BP = 41.7 , p = 0.0000 
3. VIF: ln_absences=2.47, g2=1.18, fail_abs_ratio=1.27, absences_per_study=2.35 
4. Durbin-Watson: DW = 1.98 , p = 0.7580 

=== Summary ===
No multicollinearity (VIF < 5)
Independence passed

Limitations: - Sample size (395 students) limits generalizability - Single school context (Portugal) - may not extend to other regions - Cross-sectional design - associations only, no causality - Linear relationships assumed (non-linear patterns possible)

Slide 11: Backward Stepwise Selection

Starting from Full Model → Removing Variables

Backward Stepwise Selection Results:

Call:
lm(formula = g3 ~ failures + ln_failures + ln_absences + g2 + 
    fail_abs_ratio + absences_per_study, data = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.4004 -0.4335  0.0963  0.8690  4.3071 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)        -0.97093    0.36603  -2.653 0.008393 ** 
failures            3.07158    0.74828   4.105 5.16e-05 ***
ln_failures        -5.23511    1.36914  -3.824 0.000159 ***
ln_absences         0.51006    0.14728   3.463 0.000608 ***
g2                  1.03713    0.02782  37.286  < 2e-16 ***
fail_abs_ratio     -1.54255    0.31785  -4.853 1.92e-06 ***
absences_per_study -0.04722    0.02893  -1.633 0.103573    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.688 on 314 degrees of freedom
Multiple R-squared:  0.8616,    Adjusted R-squared:  0.859 
F-statistic: 325.8 on 6 and 314 DF,  p-value: < 2.2e-16

Model formula (backward):
~, g3, failures + ln_failures + ln_absences + g2 + fail_abs_ratio + absences_per_study 

Slide 12: Forward Stepwise Selection

Starting from Null Model → Adding Variables

Forward Stepwise Selection Results:

Call:
lm(formula = g3 ~ g2 + fail_abs_ratio + ln_absences + absences_per_study + 
    failures + ln_failures, data = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.4004 -0.4335  0.0963  0.8690  4.3071 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)        -0.97093    0.36603  -2.653 0.008393 ** 
g2                  1.03713    0.02782  37.286  < 2e-16 ***
fail_abs_ratio     -1.54255    0.31785  -4.853 1.92e-06 ***
ln_absences         0.51006    0.14728   3.463 0.000608 ***
absences_per_study -0.04722    0.02893  -1.633 0.103573    
failures            3.07158    0.74828   4.105 5.16e-05 ***
ln_failures        -5.23511    1.36914  -3.824 0.000159 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.688 on 314 degrees of freedom
Multiple R-squared:  0.8616,    Adjusted R-squared:  0.859 
F-statistic: 325.8 on 6 and 314 DF,  p-value: < 2.2e-16

Model formula (forward):
~, g3, g2 + fail_abs_ratio + ln_absences + absences_per_study + failures + ln_failures 

Slide 13: Exhaustive Search (Leaps)

Exhaustive Search (Leaps) - Finding the Optimal Subset

Leaps Exhaustive Search Results:
Best size by BIC: 5 
Best size by AdjR2: 6 

Call:
lm(formula = form_bic, data = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.4810 -0.4391  0.1444  0.8975  4.2464 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    -0.93790    0.36643  -2.560  0.01095 *  
failures        3.21139    0.74532   4.309 2.20e-05 ***
ln_failures    -5.50811    1.36247  -4.043 6.65e-05 ***
ln_absences     0.33685    0.10242   3.289  0.00112 ** 
g2              1.04189    0.02773  37.566  < 2e-16 ***
fail_abs_ratio -1.56128    0.31848  -4.902 1.52e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.692 on 315 degrees of freedom
Multiple R-squared:  0.8604,    Adjusted R-squared:  0.8582 
F-statistic: 388.4 on 5 and 315 DF,  p-value: < 2.2e-16

Slide 14: lmSubsets Analysis

Slide 15: Model Comparison

Candidate Model (Leaps) vs Candidate Model (ImSubset)

     Model Variables    R2 AdjR2    AIC    BIC
1    Leaps         4 0.854 0.852 1269.0 1291.7
2 ImSubset         4 0.853 0.851 1270.9 1293.5

=== Performance Metrics ===
Leaps model RMSE: 1.71 
ImSubset model RMSE: 1.72 

=== Model Formulas ===
Leaps: G3 ~ ln_absences + G2 + fail_abs_ratio + absences_per_study
ImSubset: G3 ~ failures + ln_absences + G2 + fail_abs_ratio

Key Difference: ImSubset retains failures directly, Leaps uses only engineered features

Winner: Leaps model selected! - Lower BIC (1291.7 vs 1293.5) - Better parsimony - Higher R² (0.852 vs 0.851) - Better fit - Engineered features more interpretable than raw failures

Rationale: Feature engineering (ratios) captures relationships better than raw variables.

Slide 16: K-Fold Cross-Validation

Final Model Validation

                                Model  RMSE Rsquared  MAE
Leaps Candidate       Leaps Candidate 1.802    0.844 1.13
ImSubset Candidate ImSubset Candidate 1.812    0.849 1.13

Best Model (Lowest RMSE): Leaps Candidate 
CV RMSE: 1.802 
CV R²: 0.844 

=== Cross-Validation Summary ===
Method: 10-fold CV
Purpose: Assess out-of-sample predictive performance
Metric: RMSE (Root Mean Squared Error) - Lower is better

Winner: Leaps Candidate 
Performance improvement: 0.01 RMSE units
Interpretation: Leaps Candidate generalizes better to unseen data

Slide 17: Final Model Summary

Leaps Model

Equation: G3 ~ ln_absences + G2 + fail_abs_ratio + absences_per_study

Model Equation:
G3 = -1.373 + 0.686 × ln_absences + 1.046 × G2 + -0.908 × fail_abs_ratio + -0.067 × absences_per_study
Performance Metrics:
R² = 0.845 
Adj R² = 0.844 
RMSE = 1.8 

Key Coefficients:
ln_absences : 0.686 (p = 7.31e-07 )
g2 : 1.046 (p = 1.24e-139 )
fail_abs_ratio : -0.908 (p = 1.92e-05 )
absences_per_study : -0.067 (p = 1.35e-02 )

Model explains 85% of variance with 4 key engineered features!

Slide 18: Prediction vs Test Performance

Out-of-Sample Validation with 95% Prediction Intervals

Test Set Performance:
R² = 0.812 
RMSE = 2.12 
MAE = 1.21 
[1] "CI = 91.89189%"

Slide 19: Discussion of Results

Key Findings & Interpretations

Key Coefficients & Interpretations:

  1. G2 (β = 1.038, p < 0.001)
    • Strongest predictor of final performance
    • Each 1-point increase in G2 → 1.04 point increase in G3
    • Early assessment (G2) highly predictive of final outcome
  2. fail_abs_ratio (β = -1.063, p < 0.001)
    • High failure-to-absence ratio → lower performance
    • Students who fail despite fewer absences struggle more
  3. ln_absences (β = 0.578, p < 0.001)
    • Log-transformed absences positively associated with grades
    • Suggests attendance alone doesn’t predict success
  4. absences_per_study (β = -0.060, p = 0.042)
    • Absences relative to study time negatively impacts performance
    • Students who miss class despite studying less affected

Model Performance: R² = 0.85, RMSE = 1.72

Limitations: - Single school dataset (Portugal) - limited generalizability - Cross-sectional design - no causal inference - Sample size (395 students) may limit power

Future Research: - Non-linear models (Random Forest, XGBoost) - Longitudinal tracking - Multi-school validation

Thank You!

Questions? The floor is now open.

We welcome questions from the tutors, professors, students, and all attendees.

Contact: Group L21G05

GitHub: Repository Link “https://github.sydney.edu.au/mwan0680/L21G05-GROUP”

Code & Analysis: Fully reproducible in Quarto